A hospital administrator wished to study the relation between patient satisfaction (a measured index score between 0-100) and patient’s age (measured in years), severity of illness (an index), anxiety level (an index), and sex (coded as 0=male, 1=female). Larger values of the index variables are associated with more satisfaction, increased severity of illness and more anxiety. The administrator randomly selected 46 patients and collected the data. The data appear in the text file patientsatisfaction.txt.
Read in the data and take a quick look:
patsatdat <- read.table("patientsatisfaction.txt", header=TRUE)
kable(head(patsatdat))
| satis.index | age | ill.sev.index | anx.index | sex |
|---|---|---|---|---|
| 48 | 50 | 51 | 2.3 | 0 |
| 57 | 36 | 46 | 2.3 | 0 |
| 66 | 40 | 48 | 2.2 | 0 |
| 70 | 41 | 44 | 1.8 | 0 |
| 89 | 28 | 43 | 1.8 | 1 |
| 36 | 49 | 54 | 2.9 | 0 |
What is different about this compared to the problems we were seeing in Chapters 2-4?
sex, coded here as an indicator (or dummy) variable. All of these are potential predictors of the response variable (patient satisfaction index).Statisticians use regression to build models to investigate the nature of such relationships amongst variables. When there are multiple predictors being simultaneousy considered, the technique is known as multiple regression modeling.
The goals of an analysis of these data are to:
The first item in the list above necessitates the importance of knowing the form of the regression model being fitted, because we may try several different models for a given data set. The basic form of a main effects multiple linear regression (MLR) model with \(k\) predictors is:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_k X_k + \varepsilon\]
In the present context, this model is given by
\[Patient Satis Score = \beta_0 + \beta_1 (Age) + \beta_2 (Illness Severity Index) + \beta_3 (Anxiety Index) + \beta_4 (Sex) + \varepsilon\]
While it is still important to observe the distribution of the individual variables in the study (like we have done in prior EDAs), the main focus of a regression analysis is to study associations between variables. This can be effectively done via a scatterplot matrix (available in GGally with the ggscatmat() function), which displays the pairwise associations that exist in the sample data, along with Pearson correlation coefficients:
ggscatmat(patsatdat)
Correlation between two variables measures the amount of linear relationship between the two variables. A correlation is typically denoated with an \(r\) (e.g., \(r=-0.79\) for the satis.index and age variables). As a reminder, any correlation is bounded between (-1, 1) and the distance from 0 provides a magnitude of the strength and direction of the bivariate linear relationship (thus age and satis.index have a moderately strong negative relationship).
What are some things to notice in the above EDA?
Over the coming weeks we will answer all of these (and more!) questions by covering various statistical methods.
Now, how do we determine how to fit a model of the form above to these data? For that, we consider a simple dataset with just one predictor variable…
Manatees are large, gentle sea creatures that live along the Florida coast. The number of manatees killed by motorboats in the state of Florida has continually been monitored since 1974 by the Florida Fish and Wildlife Conservation Commission. The number of motor boats registered within the state since 1977 is available from the state’s Department of Motor Vehicles. The original data are attribued to Elementary Statistics, 9th edition by Mario Triola but have since been updated. The data are in the file manatee.csv on the canvas site.
Below is a scatterplot of the raw data. Note our creativity in choice of a plotting symbol:
Visually it looks as if there is a upward trending line (more boats registered, more manatees killed). Suppose we decide to fit the following model to the data:
\[Y = \beta_0 + \beta_1 X + \varepsilon\]
where \(Y\) = the number of manatees killed and \(X\) is the number of boats registered in year \(i\).
How do we estimate the coefficients \(\beta_0\) and \(\beta_1\)? Let’s consider different line “fits” to the data in the following interactive display. You may also see details of each data point by hovering your cursor over each point:
What choice appears to be the best? It’s hard to say. For a bit more focus, let’s consider just two specific competing fitted lines:
The plot above includes a depiction of the model “error” for each point for each of the two fitted lines, calculated from the observed number of Manatees Killed as compared to the fitted line (which produces a model-based predicted number of Mantees Killed). Ideally, we want a line with a “smaller” errors overall. We find the best such model using the method of least squares, where we minimize the equation
\[\begin{eqnarray*} RSS &=& \sum_{i=1}^n \left(\textrm{Error at point } i\right)^2 \\ &=& \sum_{i=1}^n \left(e_i\right)^2 \\ &=& \sum_{i=1}^n \left(Y_i - \hat{Y}_i\right)^2 \\ &=& \sum_{i=1}^n \left(Y_i - b_0 - b_1 X_i\right)^2 \end{eqnarray*}\]
For the two models fitted in the display above, the RSS for each is
| Group | RSS |
|---|---|
| Model 1 | 4398.861 |
| Model 2 | 3222.810 |
So Model 2 is a better fit than Model 1 (you can see that visually as well) … but is it the best?
To determine the best fitting model overall, we need to use calculus with the above RSS function and find the optimal values of \(b_0\) and \(b_1\). In R, we do this with the lm() function. Consider the following:
lm(ManateesKilled ~ BoatsRegistered, data=manatee)
##
## Call:
## lm(formula = ManateesKilled ~ BoatsRegistered, data = manatee)
##
## Coefficients:
## (Intercept) BoatsRegistered
## -43.5956 0.1326
Thus, for every boat registered, we can expect 0.1326 manatees to be killed by motor boats. Or to phrase it another way, for every 100 boats registered, you can expect approximately 13 manatees on average to be killed by motorboats.
Note the intercept here is conceptually meaningless: it corresponds to the expected number of manatees killed by motorboats when zero motorboats are registered in the state of Florida. (This is a common phenomenon of regression – the intercept often tends to have no contextual interpretation.)
We will fit the multiple regression model
\[Patient Satis Score = \beta_0 + \beta_1 (Age) + \beta_2 (Illness Severity Index) + \beta_3 (Anxiety Index) + \beta_4 (Sex) + \varepsilon\]
to the data using R and the function lm. For now, we will forego the checking of assumptions (we’ll see this very soon!) because the validity of the assumptions underlying the model mainly pertain to using the model for statistical inference purposes (Chapter 6).
mr.model1 <- lm(satis.index ~ age + ill.sev.index + anx.index + sex, data=patsatdat)
summary(mr.model1)
##
## Call:
## lm(formula = satis.index ~ age + ill.sev.index + anx.index +
## sex, data = patsatdat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.347 -6.417 0.520 8.374 17.165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 158.49433 18.36268 8.631 9.16e-11 ***
## age -1.14173 0.21951 -5.201 5.86e-06 ***
## ill.sev.index -0.44201 0.49793 -0.888 0.3799
## anx.index -13.46700 7.23154 -1.862 0.0697 .
## sex -0.01183 3.04192 -0.004 0.9969
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.18 on 41 degrees of freedom
## Multiple R-squared: 0.6822, Adjusted R-squared: 0.6512
## F-statistic: 22 on 4 and 41 DF, p-value: 9.332e-10
Several interpretive things to take note of here:
Multiple R-squared in the output, and here is 0.6822. So about 68% of the variance in the response can be attributed to the modeled relationship with age, illness severity, anxiety severity, and sex. (Obviously, since \(R^2\) is a proportion, it must be between 0 and 1. Higher values are desirable, but there is a big caveat with that which we will discuss soon).In the text, we discuss the practical challenges in interpreting the \(\beta\)-parameter estimates in the model fit to observational data. Nonetheless, here is an interpretation of them in the present context. We will discuss in class what the practical challenges are, however:
sex was coded as a 0/1 dummy variable in the data, with 0 = male and 1 = female. So actually, the variable sex is an indicator of being female. Thus, a “one-point increase” on sex amounts to changing from males to females. So, when all other predictors are held fixed, the predicted mean patient satisfaction score is 0.11 lower for females as compared to males.Now, what are some of the issues with practical interpretation of these parameter estimates?